0% found this document useful (0 votes)
17 views70 pages

CSU22012 2024 Lecture 7

Uploaded by

horvat.nicholas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views70 pages

CSU22012 2024 Lecture 7

Uploaded by

horvat.nicholas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CSU22012: Data Structures and

Algorithms II
Lecture 7: Substrings
Dr Anthony Ventresque
Outline of Substring search algorithms

• Brute force
• KMP (Knuth-Morris-Pratt)
• Boyer-Moore
• Rabin-Karp
• Many many many others

• Suffix arrays
• LCP ( longest common prefix) arrays

Trinity College Dublin, The University of Dublin


Java String implementation

• Which algorithm does String.IndexOf(String) use?


• Naïve loop (brute force)
• Why?
• String.contains()

Trinity College Dublin, The University of Dublin


Common interview questions

• Implement a needle-in-a-haystack
• public int Search(String haystack, String needle)
• Implement strstr()
• Find the first instance of a string in another string
• Longest common substring between 2 files
• Longest substring that’s a palindrome
• Longest repeated substring
• Etc etc

Trinity College Dublin, The University of Dublin


Different to Pattern Matching

• Find a pattern, i.e. one of the specified set of substrings in a text


• Regular expression – notation to specify a set of strings
• For more info see 5.4 in Sedgewick and Wayne

Trinity College Dublin, The University of Dublin


Substring search - definition

Trinity College Dublin, The University of Dublin


Substring search – brute force

Trinity College Dublin, The University of Dublin


Substring search – brute force

Trinity College Dublin, The University of Dublin


Substring search – brute force

Trinity College Dublin, The University of Dublin


Substring search – backup

Trinity College Dublin, The University of Dublin


Substring search – explicit backup

Trinity College Dublin, The University of Dublin


Trinity College Dublin, The University of Dublin
Knuth-Morris-Pratt (KMP)

Trinity College Dublin, The University of Dublin 13


KMP

• 1970 by Donald Knuth and Vaughan Pratt


• + Independently by James H. Morris.

• (Donald Knuth - The Art of Computer Programming - comprehensive monograph that


covers many kinds of programming algorithms and their analysis – 4 volumes and
counting)

Trinity College Dublin, The University of Dublin


KMP

Trinity College Dublin, The University of Dublin


KMP – avoid back up how? DFA

• DFA – Deterministic Final State Automaton


• Finite State Automaton/Finite State Machine
• mathematical model of computation
• an abstract machine that can be in exactly one of a finite number of states at any
given time.
• can change from one state to another in response to some external inputs
• the change from one state to another is called a transition
• defined by a list of its states, its initial state, and the conditions for each transition.
• Deterministic - produces a unique computation (or run) of the automaton for each
input string
• DFA - finite-state machine that accepts and rejects strings of symbols

Trinity College Dublin, The University of Dublin


FSA – an example

Trinity College Dublin, The University of Dublin


Finite State Machine – more formally

• Deterministic Finite Automata are always complete: they define a transition for
each state and each input symbol.

Trinity College Dublin, The University of Dublin


DFA

Trinity College Dublin, The University of Dublin


Coming up:

• DFA simulation
• assume DFA given
• DFA construction
• manual
• DFA construction
• Algorithm/code

Trinity College Dublin, The University of Dublin


DFA Simulation

Trinity College Dublin, The University of Dublin 21


DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

10
ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

33
ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA simulation

B C B A A B A C A A B A B A C A A

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

ADD A FOOTER
Trinity College Dublin, The University of Dublin
DFA States – number of characters matched

Trinity College Dublin, The University of Dublin


DFA simulation exercise

• Consider the following DFA for searching for a string “IVANA”


• For simplicity, we assume the alphabet contains only letters A, I, N, V
• DFA is therefore as follows

j 0 1 2 3 4

char? I V A N A

A 0 0 3 0 5

I 1 1 1 1 1

N 0 0 0 4 0

V 0 2 0 0 0

Trinity College Dublin, The University of Dublin


Exercise 1 Simulating DFA:

1. Construct graphical representation of the DFA table


2. Write the trace of states when searching for a string “IVANA” in input
“ANVAIVAAIVANAAN”

j 0 1 2 3 4

char? I V A N A

A 0 3 0 5

I 1 1 1 1 1

N 0 0 0 4 0

V 0 2 0 0 0

Trinity College Dublin, The University of Dublin


Exercise 1 Simulating DFA:

j 0 1 2 3 4

char? I V A N A

A 0 0 3 0 0

I 1 1 1 1 1

N 0 0 0 4 0

V 0 2 0 0 5

Text ANVAIVAAIVANAAN
Search string IVANA
Trinity College Dublin, The University of Dublin
KMP Java Implementation

Trinity College Dublin, The University of Dublin


DFA Construction

Trinity College Dublin, The University of Dublin 43


Knuth-Morris-Pratt construction

Include one state for each character in pattern (plus accept state).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C

0 1 2 3 4 5 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Match transition: advance to next state if c == pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 2 4
C 6

0 A 1 B 2 A 3 4 A 5 C 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 0 2 4
C 0 6

B,C
j

0 A 1 B 2 A 3 4 A 5 C 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 5
dfa[][j] B 0 2 4
C 0 0 6

B,C A j

0 A 1 B 2 A 3 4 A 5 C 6
C

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 5
dfa[][j] B 0 2 0 4
C 0 0 0 6

B,C A j

0 A 1 B 2 A 3 4 A 5 C 6
C
B,C

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5
dfa[][j] B 0 2 0 4
C 0 0 0 0 6

B,C A j
A
0 A 1 B 2 A 3 4 A 5 C 6
C
B,C C
X

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5
dfa[][j] B 0 2 0 4 0
C 0 0 0 0 0 6

B,C A j
X A
0 A 1 B 2 A 3 4 A 5 C 6
C
B,C C B,C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition: back up if c != pat.charAt(j).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A X A j
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

Trinity College Dublin, The University of Dublin


Exercise 2 constructing DFA:

• Construct DFA table and graphical representation for a search word “banana”
• Make up a 15-letter string in which you’re going to search for the word, assuming the
alphabet contains only letters.
• You can decide whether you want the string to contain the search word or not, but if
it does, do not have it too early into the string
• Write out the trace of DFA states while searching for the word in the madeup string

Trinity College Dublin, The University of Dublin


DFA construction – Java code

Include one state for each character in pattern (plus accept state).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C

0 1 2 3 4 5 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


DFA Construction – Java code

Trinity College Dublin, The University of Dublin


DFA construction – Java code

Match transition. For each state j, dfa[pat.charAt(j)][j] = j+1.

first j characters of pattern now first j+1 characters of


have already been matched pattern have been matched

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 2 4
C 6

0 A 1 B 2 A 3 4 A 5 C 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


DFA Construction – Java code

Trinity College Dublin, The University of Dublin


DFA construction

So lets do this again, while maintaining state x

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Include one state for each character in pattern (plus accept state).

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A
dfa[][j] B
C

0 1 2 3 4 5 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Match transition. For each state j, dfa[pat.charAt(j)][j] = j+1.

first j characters of pattern now first j+1 characters of


have already been matched pattern have been matched

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 2 4
C 6

0 A 1 B 2 A 3 4 A 5 C 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition.

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5
dfa[][j] B 0 2 4
C 0 6

B,C
j

0 A 1 B 2 A 3 4 A 5 C 6

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition. For each state j and char c != pat.charAt(j),


dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 5
dfa[][j] B 0 2 4
C 0 0 6

B,C A j

0 A 1 B 2 A 3 4 A 5 C 6
C

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition. For each state j and char c != pat.charAt(j),


dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 5
dfa[][j] B 0 2 0 4
C 0 0 0 6

B,C A j

0 A 1 B 2 A 3 4 A 5 C 6
C
B,C

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition. For each state j and char c != pat.charAt(j),


dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5
dfa[][j] B 0 2 0 4
C 0 0 0 0 6

B,C A j
A
0 A 1 B 2 A 3 4 A 5 C 6
C
B,C C
X

Constructing the DFA for KMP substring search for A B A B A C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition. For each state j and char c != pat.charAt(j),


dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5
dfa[][j] B 0 2 0 4 0
C 0 0 0 0 0 6

A j
B,C X A
0 A 1 B 2 A 3 4 A 5 C 6
C
B,C C B,C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

Mismatch transition. For each state j and char c != pat.charAt(j),


dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A X A j
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

Trinity College Dublin, The University of Dublin


Knuth-Morris-Pratt construction

j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6

A A
B,C
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C

20
Trinity College Dublin, The University of Dublin
DFA Construction – Java code

Trinity College Dublin, The University of Dublin


KMP search – Java code

Trinity College Dublin, The University of Dublin


KMP search performance

Trinity College Dublin, The University of Dublin

You might also like