See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/326209389
Pattern Matching Algorithms
Presentation · April 2017
DOI: 10.13140/RG.2.2.27925.63200
CITATIONS READS
0 1,525
1 author:
Kamran Mahmoudi
Imam Khomeini International University
37 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Design, development and primitive evaluation of ADHD gamified assessment tool View project
Scientific Data Transfer using Big Data Tools View project
All content following this page was uploaded by Kamran Mahmoudi on 05 July 2018.
The user has requested enhancement of the downloaded file.
Pattern matching algorithms
Presentation by : kamran Mahmoudi [
[email protected]]
Under supervision of dr. Mahdavi
Imam Khomeini international university, April 2017
Pattern matching in Bioinformatics
Certain known nucleotide and/or amino acid sequences have properties
known to biologists. Ex. ATG is a string which must be present at the
beginning of every protein (gene) a DNA sequence.
Finding if a DNA sequence contains a specific (candidate) primer is therefore
paramount to the ability to run correct PCR.
A conserved DNA sequence is a sequence of nucleotides in DNA, which is found
in the DNA of multiple species and/or multiple strains.
Some sequences are conserved precisely. However, a lot of sequences are
conserved with some modifications. Finding such modified strings is an
important process for mapping DNA of a new organism.
Intro.
Needle in a haystack
the string matching problem consists of
finding a (usually short) string, the
pattern , as a substring in a given
(usually very long) string, the text .[1]
1/56
Formal Definition
Let Σ be an arbitrary alphabet.
The (exact) string matching problem is the following problem:
Input: Two strings t=t1….tn and p= p1…pm over Σ.
Output: The set of all positions in the text t, where an occurrence of
the pattern p as a substring starts [1].
2/56
Classification
using preprocessing as main criteria
Classes of string searching algorithms [2]
Text not preprocessed Text preprocessed
Patterns not
primitive algorithms Index methods
preprocessed
Constructed search
Patterns preprocessed Signature methods
engines
3/56
Basic classification
Single Pattern Algorithms
✓ Naïve String Search
✓ Knuth-Morris-Pratt Algorithm
✓ Boyer-Moore Algorithm
✓ Rabin-Karp String Search Algorithm
✓ Finite State Automaton Based Search
Bitap algorithm (shift-or, shift-and, Baeza–Yates–Gonnet)
Two-way string-matching algorithm
BNDM (Backward Non-Deterministic Dawg Matching)
BOM (Backward Oracle Matching) 4/56
Basic classification
Algorithms using a finite set of patterns
Aho–Corasick string matching algorithm (extension of Knuth-
Morris-Pratt)
Commentz-Walter algorithm (extension of Boyer-Moore)
Set-BOM (extension of Backward Oracle Matching)
Rabin–Karp string search algorithm
5/56
Basic classification
Algorithms using an infinite number of patterns
Naturally,the patterns can not be enumerated finitely in this
case. They are represented usually by a regular grammar or
regular expression.
6/56
Naïve string search
Input: a pattern p= p1…pm and a text t=t1….tn
I := φ
For j:=0 to n-m do
i:=1
while pi=tj+1 and i<=m do
i:=i+1
if i=m+1 then {p1…pm=tj+1… tj+m}
I := I U {j+1}
Output: The set I of positions,
where an occurrence of p as a substring in t starts
7/56
Knuth–Morris–Pratt algorithm
KMP-prefix(P)
Begin
m |P|
T[1] 0
i 0
for j=2 upto m step 1 do
while i>0 and P[i+1] != P[j] then
i T[i]
if P[i+1] = P[j] then
i i+1
T[j] i
return T
end
8/56
Knuth–Morris–Pratt algorithm
KMP-Matcher(T,P)
Begin
n |T|
m |P|
Table KMP-Prefix(P)
i 0
for j=0 upto n step 1 do
while i>0 and P[i+1] != T[j] do
i Table[i]
Wend
if P[i+1] = T[j] then
i i+1
end if
if i = m then
output(j-m)
iTable[i] 9/56
end if
end
The Boyer-Moore algorithm
The Boyer-Moore algorithm searches for occurrences of P in T by performing
explicit character comparisons at different alignments.
Instead of a brute-force search of all alignments (of which there are m − n + 1),
Boyer-Moore uses information gained by preprocessing P to skip as many
alignments as possible. [3]
10/56
The Bad Character Rule
The bad-character rule considers the character in T at which the comparison
process failed.The next occurrence of that character to the left in P is found,
and a shift which brings that occurrence in line with the mismatched occurrence
in T is proposed.[3]
THE GOOD SUFFIX RULE
• If we match some characters, use knowledge of the matched characters to skip
alignments. [4]
11/56
Ex.1: the bad character rule
[4] 12/56
Preprocessing for the bad character rule
Input: a pattern p= p1…pm over alphabet Σ
For all a ∈ Σ do β(a):=0
For i:=1 to m do β(pi):=i
Output: the function β.
13/56
Good suffix rule
Let t be the substring of T that matched a suffix of P. Skip
alignments until
(a) t matches opposite characters in P
(b) a prefix of P matches a suffix of t
(c) P moves past t
whichever happens first.
14/56
Bad match rule & good suffix rule
15/56
( https://www.youtube.com/watch?v=4Xyhb72LCX4 )
Rabin-Karp – the idea
Compare a string's hash values, rather than the strings themselves.
For efficiency, the hash value of the next position in the text is easily
computed from the hash value of the current position. [5]
16/56
Example
Pattern = AAT
Text = TAACGGCATACAATCG
Character values :
A=1
Calculate hash from oldHash code method
: T=2
1. X=oldHash – val(old char) C=3
2. X=x/prime G=4
3. newHash=X+primem-1 * val(new char) Prime number=7
17/56
Example, Rabin-Karp algorithm
Pattern = AAT
H(AAT)= 1 + 1*7 + 2*49 = 106
▪ Text = TAGACAATCG H(TAG)=2+1*7+4*49 = 205 !=106
▪ Text = TAGACAATCG H(AGT)=(205-2)/7+1*49 = 78 != 106
▪ Text = TAGACAATCG H(GAC)=(78-2)/7+3*49 = 157 != 106
▪ Text = TAGACAATCG H(ACA)=(157-2)/7+1*49 = 71 != 106
▪ Text = TAGACAATCG H(CAA)=(71-2)/7+1*49 = 58 != 106
✓ Text = TAGACAATCG H(AAT)=(58-2)/7+2*49 = 106 ==106
18/56
Finite state automaton
we will show that, after a clever preprocessing of the
pattern, one scan of the text from left to right will suffice to
solve the string matching problem. Furthermore we will see
that the preprocessing can also be realized efficiently; it is
possible in time in O(|p|.|Σ|). [1]
19/56
Informal definition of automata
Informally speaking, a finite automaton can be described as a machine that
reads a given text once from left to right. At each step, the automaton is in
one of finitely many internal states, and this internal state can change after
reading every single symbol of the text, depending only on the current state
and the last symbol read.
20/56
Formal definition
A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where
Q is a finite set of states,
Σ is an input alphabet,
q0 ∈ Q is the initial state,
F ⊆Q is a set of accepting states , and
Ϭ : Q x Σ Q is a transition function describing the transitions
of the
automaton from one state to another.
21/56
Why using finite state machine
Complex pattern matching like non-finite regular
expressions :
Finite State Machine (FSM) aka DFA
Time Complexity :
Preprocessing : O(m3 |Σ|)
Matching: 𝜃 (n)
22/56
String matching with FSM
23/56
( https://www.youtube.com/watch?v=nNb9lu5Hvio )
FSM Matching algorithm
FINITE-AUTOMATON-MATCHER(T,d,m)
1. n length[T]
2. q 0
3. for i 1 to n
4. do q Ϭ(q, T[i])
5. if q=m then
6. print `Pattern occurs with shift' i-m
24/56
Transition-function construction
algorithm
1. m length[P]
2. for q 0 to m (for each state)
3. do for each character a ∈ Σ (|Σ|)
4. do k min(m+1, q+2)
5. repeat k k-1 (1 ≤ k ≤ m+1)
6. until Pk ⊐ Pqa (Σ k )
7. Ϭ(q,a) k
8. return Ϭ
25/56
Better solution: suffix trees
Can solve problem in O(m) time
• Conceptually related to keyword trees [7]
26/56
[8]
27/56
28/56
29/56
30/56
31/56
32/56
33/56
34/56
35/56
36/56
37/56
38/56
39/56
40/56
41/56
42/56
43/56
44/56
45/56
46/56
47/56
48/56
49/56
50/56
51/56
Weiner’s Algorithm I
Definitions
i: suffix tree for Si=S[i..n]$
WHead(i): longest prefix of Si that is also prefix of Sj j>i
Proceeding
Build n+1 = edge (root, n+1) labelled $
For i from n to 1 do
Find WHead(j) in Wj+1
w = node labelled WHead(j) (eventually new created)
Create new leaf j and edge (w,j) labelled
S[j..n]-WHead(j)
52/56
[7]
53/56
54/56
[9]
Ukkonen’s suffix tree
(https://www.youtube.com/watch?v=WbLKFzqvacg )
55/56
Suffix array
n computer science, a suffix array is a sorted array of all suffixes of a string.
It is a data structure used, among others, in full text indices, data
compression algorithms and within the field of bioinformatics
P.S. 1
Suffix array, example
P.S. 2
Suffix array, example (continue)
P.S. 3
Suffix array – pattern matching
def search(P):
l = 0; r = n
while l < r:
mid = (l+r) / 2
if P > suffixAt(A[mid]):
l = mid + 1
else:
r = mid
s = l; r = n
while l < r:
mid = (l+r) / 2
if P < suffixAt(A[mid]):
r = mid
else:
l = mid + 1
return (s, r)
P.S. 4
References
[1]: Hans-Joachim Bockenhauer, Dirk Bongartz, “Algorithmic Aspects of Bioinformatics ”,
2007 Natural computing series, Springer, ISSN 1619-7127
[2]: https://en.wikipedia.org/wiki/String_searching_algorithm
[3]: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
[4]: http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf
[5]: http://u.cs.biu.ac.il/~rosenfa5/Alg2/fingerpainting.ppt
[6]: http://web.cs.mun.ca/~wang/courses/cs6783-13f/n2-string-1.pdf
[7]: http://www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf
[8]:
http://bix.ucsd.edu/bioalgorithms/presentations/Ch09_CombinatorialPatternMatching.pdf
[9]: http://wwwmayr.in.tum.de/konferenzen/Jass03/presentations/pentenrieder.pdf
56/56
View publication stats