0% found this document useful (0 votes)

53 views22 pages

Week 9 String Algorithms, Approximation

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views22 pages

Week 9 String Algorithms, Approximation

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Week 9: String Algorithms, Approximation

Strings

2/90
Strings
A string is a sequence of characters.

An alphabet Σ is the set of possible characters in strings.

Examples of strings:

!"C program
!"HTML document
!"DNA sequence
!"Digitised image

Examples of alphabets:

!"ASCII
!"Unicode
!"{0,1}
!"{A,C,G,T}

... Strings 3/90

Notation:

!"length(P) … #characters in P
!"λ … empty string (length(λ) = 0)
!"Σm … set of all strings of length m over alphabet Σ
!"Σ* … set of all strings over alphabet Σ

νω denotes the concatenation of strings ν and ω

Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ

... Strings 4/90

Notation:

!"substring of P … any string Q such that P = νQω, for some ν,ω∈Σ*

!"prefix of P … any string Q such that P = Qω, for some ω∈Σ*
!"suffix of P … any string Q such that P = ωQ, for some ω∈Σ*

Exercise #1: Strings 5/90

The string a/a of length 3 over the ASCII alphabet has

!"how many prefixes?

!"how many suffixes?
!"how many substrings?
!"4 prefixes: "" "a" "a/" "a/a"
!"4 suffixes: "a/a" "/a" "a" ""
!"6 substrings: "" "a" "/" "a/" "/a" "a/a"

Note: substring must be unique

"" means the same as λ (= empty string)

... Strings 7/90

ASCII (American Standard Code for Information Interchange)

!"Specifies mapping of 128 characters to integers 0..127

!"The characters encoded include:
#"upper and lower case English letters: A-Z and a-z
#"digits: 0-9
#"common punctuation symbols
#"special non-printing characters: e.g. newline and space

Pattern Matching

9/90
Pattern Matching
Example (pattern checked backwards):

!"Text … abacaab
!"Pattern … abacab

... Pattern Matching 10/90

Given two strings T (text) and P (pattern),

the pattern matching problem consists of finding a substring of T equal to P
Applications:

!"Text editors
!"Search engines
!"Biological research

... Pattern Matching 11/90

Naive pattern matching algorithm

!"checks for each possible shift of P relative to T

#"until a match is found, or
#"all placements of the pattern have been tried

12/90
Analysis of Naive Pattern Matching
Naive pattern matching runs in O(n·m)

Examples of worst case (forward checking):

!"T = aaa…ah
!"P = aaah
!"may occur in DNA sequences
!"unlikely in English text

Exercise #2: Naive Matching 13/90

Suppose all characters in P are different.

Can you accelerate NaiveMatching to run in O(n) on an n-character text T?

When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]

⇒ each character in T checked at most twice

Example:

abcdabcdeabcc abcdabcdeabcc
abcdexxxxxxxx xxxxabcde
15/90
Boyer-Moore Algorithm
The Boyer-Moore pattern matching algorithm is based on two heuristics: **Search starting from BACK*
continue from the back EVERY TIME
!"Looking-glass heuristic: Compare P with subsequence of T moving backwards
!"Character-jump heuristic: When a mismatch occurs at T[i]=c
#"if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]

#"otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")

... Boyer-Moore Algorithm 16/90

Example:

... Boyer-Moore Algorithm 17/90

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

!"last-occurrence function L
#"L maps Σ to integers such that L(c) is defined as
$"the largest index i such that P[i]=c, or
$"-1 if no such index exists

Example: Σ = {a,b,c,d}, P = acab

c a b c d

L(c) 2 3 1 -1

!"L can be represented by an array indexed by the numeric codes of the characters
!"L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)

... Boyer-Moore Algorithm 18/90

... Boyer-Moore Algorithm 19/90

l … last occurrence L[T[i]] of character T[i]

!"Case 1: j ≤ 1+l ⇒ i = i+m-j

!"Case 2: 1+l < j ⇒ i = i+m-(1+l)

Exercise #3: Boyer-Moore algorithm 20/90

For the alphabet Σ = {a,b,c,d}

1. compute last-occurrence function L for pattern P = abacab

2. trace Boyer-More on P and text T = abacaabadcabacabaabb
#"how many comparisons are needed?

c a b c d

L(c) 4 5 3 -1
13 comparisons in total

... Boyer-Moore Algorithm 22/90

Analysis of Boyer-Moore algorithm:

!"Runs in O(nm+s) time

#"m … length of pattern n … length of text s … size of alphabet
!"Example of worst case:
#"T = aaa … a
#"P = baaa
!"Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than naive matching on English text

23/90
Knuth-Morris-Pratt Algorithm
**Search starting from FRONT*
The Knuth-Morris-Pratt algorithm … continue on current i

!"compares the pattern to the text left-to-right

!"but shifts the pattern more intelligently than the naive algorithm

Reminder:

!"Q is a prefix of P … P = Qω, for some ω∈Σ*

!"Q is a suffix of P … P = ωQ, for some ω∈Σ*

... Knuth-Morris-Pratt Algorithm 24/90

When a mismatch occurs …

!"what is the most we can shift the pattern to avoid redundant comparisons?
!"Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]
... Knuth-Morris-Pratt Algorithm 25/90

KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself

!"Failure function F(j) defined as

#"the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
$"for each position j=0..m-1
!"if mismatch occurs at Pj ⇒ advance j to F(j-1)

Example: P = abaaba
j 0 1 2 3 4 5

Pj a b a a b a

F(j) 0 0 1 1 2 3

... Knuth-Morris-Pratt Algorithm 26/90

Exercise #4: KMP-Algorithm 27/90

1. compute failure function F for pattern P = abacab

2. trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb
#"how many comparisons are needed?
j 0 1 2 3 4 5

Pj a b a c a b

F(j) 0 0 1 0 1 2

19 comparisons in total

... Knuth-Morris-Pratt Algorithm 29/90

Analysis of Knuth-Morris-Pratt algorithm:

!"Failure function can be computed in O(m) time (→ next slide)

!"At each iteration of the while-loop, either
#"i increases by one, or
#"the pattern is shifted ≥1 to the right ("shift amount" i-j increases since always F(j-1)<j)
⇒ i can be incremented at most n times, pattern can be shifted at most n times
⇒ there are no more than 2·n iterations of the while-loop

⇒ KMP's algorithm runs in optimal time O(m+n)

... Knuth-Morris-Pratt Algorithm 30/90

Construction of the failure function matches pattern against itself:

Trace the failureFunction algorithm for pattern P = abaaba

⇒ F[0]=0
j=1, len=0, P[1]≠P[0] ⇒ F[1]=0
j=2, len=0, P[2]=P[0] ⇒ len=1, F[2]=1
j=3, len=1, P[3]≠P[1] ⇒ len=F[0]=0
j=3, len=0, P[3]=P[0] ⇒ len=1, F[3]=1
j=4, len=1, P[4]=P[1] ⇒ len=2, F[4]=2
j=5, len=2, P[5]=P[2] ⇒ len=3, F[5]=3

... Knuth-Morris-Pratt Algorithm 33/90

Analysis of failure function computation:

!"At each iteration of the while-loop, either

#"i increases by one, or
#"the "shift amount" i-j increases by at least one (remember that always F(j-1)<j)
!"Hence, there are no more than 2·m iterations of the while-loop

⇒ failure function can be computed in O(m) time

34/90
Boyer-Moore vs KMP
Boyer-Moore algorithm

!"decides how far to jump ahead based on the mismatched character in the text
!"works best on large alphabets and natural language texts (e.g. English)

Knuth-Morris-Pratt algorithm

!"uses information embodied in the pattern to determine where the next match could begin
!"works best on small alphabets (e.g. A,C,G,T)

For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet

Word Matching With Tries

36/90
Preprocessing Strings
Preprocessing the pattern speeds up pattern matching queries

!"After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length

If the text is large, immutable and searched for often (e.g., works by Shakespeare)

!"we can preprocess the text instead of the pattern immutable: not change

... Preprocessing Strings 37/90

A trie …

!"is a compact data structure for representing a set of strings

#"e.g. all the words in a text, a dictionary etc.
!"supports pattern matching queries in time proportional to the pattern size

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"

38/90
Tries
Tries are trees organised using parts of keys (rather than whole keys)

Exercise #6: 39/90

How many words are encoded in the trie on the previous slide? depends on how many red nodes

... Tries 41/90

Each node in a trie …

!"contains one part of a key (typically one character)

!"may have up to 26 children
!"may be tagged as a "finishing" node
!"but even "finishing" nodes may have children

Depth d of trie = length of longest key value

Cost of searching O(d) (independent of n)

... Tries 42/90

Possible trie representation:

#define ALPHABET_SIZE 26

typedef struct Node *Trie;

typedef struct Node {

bool finish; // last char in key?
Item data; // no Item if !finish
Trie child[ALPHABET_SIZE];
} Node;

typedef char *Key;

... Tries 43/90

Note: Can also use BST-like nodes for more space-efficient implementation of tries

44/90
Trie Operations
Basic operations on tries:

1. search for a key

2. insert a key

... Trie Operations 45/90

... Trie Operations 46/90

Traversing a path, using char-by-char from Key:

... Trie Operations 47/90

Insertion into Trie:

Exercise #7: Trie Insertion 48/90

Insert cat, cats and carer into this trie:

... Trie Operations 50/90

Analysis of standard tries:

!"O(n) space
!"insertion and search in time O(m)
#"n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
#"m … size of the string parameter of the operation (the "key")

Word Matching With Tries

52/90
Word Matching with Tries
Preprocessing the text:

1. Insert all searchable words of a text into a trie

2. Each finishing node stores the occurrence(s) of the associated word in the text

... Word Matching with Tries 53/90

Example text and corresponding trie of searchable words:

54/90
Compressed Tries
Compressed tries …

!"have internal nodes of degree ≥ 2

!"are obtained from standard tries by compressing "redundant" chains of nodes

Example:
Exercise #8: Compressed Tries 55/90

Consider this uncompressed trie:

How many nodes (including the root) are needed for the compressed trie?

7 ma, man, mane, c, ar, rab, root

57/90
Pattern Matching With Suffix Tries
The suffix trie of a text T is the compressed trie of all the suffixes of T

Example:

... Pattern Matching With Suffix Tries 58/90

Compact representation:
... Pattern Matching With Suffix Tries 59/90

Input:

!"compact suffix trie for text T

!"pattern P

Goal:

!"find starting index of a substring of T equal to P

Exercise #9: Suffix Tries 60/90

Construct the compressed suffix trie for T = banana

... Pattern Matching With Suffix Tries 62/90

... Pattern Matching With Suffix Tries 63/90

Analysis of pattern matching using suffix tries:

Suffix trie for a text of size n …

!"can be constructed in O(n) time

!"uses O(n) space
!"supports pattern matching queries in O(m) time
#"m … length of the pattern

Text Compression

65/90
Text Compression
Problem: Efficiently encode a given string X by a smaller string Y

Applications:

!"Save memory and/or bandwidth

Huffman's algorithm

!"computes frequency f(c) for each character c

!"encodes high-frequency characters with short code
!"no code word is a prefix of another code word
!"uses optimal encoding tree to determine the code words

... Text Compression 66/90

Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word

Encoding tree …

!"represents a prefix code

!"each leaf stores a character
!"code word given by the path from the root to the leaf (0 for left child, 1 for right child)

... Text Compression 67/90

Example:

... Text Compression 68/90

Text compression problem

Given a text T, find a prefix code that yields the shortest encoding of T

!"short codewords for frequent characters

!"long code words for rare characters
Exercise #10: 69/90

Two different prefix codes:

Which code is more efficient for T = abracadabra?

T1 requires 29 bits to encode text T,

T2 requires 24 bits.

01011011010000101001011011010 vs 001011000100001100101100

71/90
Huffman Code
Huffman's algorithm

!"computes frequency f(c) for each character

!"successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"

Example: abracadabra

... Huffman Code 72/90

Huffman's algorithm using priority queue:

Exercise #11: Huffman Code 73/90

Construct a Huffman tree for: a fast runner need never be afraid of the dark

... Huffman Code 75/90

Analysis of Huffman's algorithm:

!"O(n+d·log d) time
#"n … length of the input text T
#"d … number of distinct characters in T

Approximation

77/90
Approximation for Numerical Problems
Approximation is often used to solve numerical problems by

!"solving a simpler, but much more easily solved, problem

!"where this new problem gives an approximate solution
!"and refine the method until it is "accurate enough"

Examples:

!"roots of a function f
!"length of a curve determined by a function f
!"… and many more
... Approximation for Numerical Problems 78/90

Example: Finding Roots

Find where a function crosses the x-axis:

Generate and test: move x1 and x2 together until "close enough"

... Approximation for Numerical Problems 79/90

A simple approximation algorithm for finding a root in a given interval:

bisection guaranteed to converge to a root if f continuous on [x1,x2] and f(x1) and f(x2) have opposite signs

... Approximation for Numerical Problems 80/90

Example: Length of a Curve

Estimate length: approximate curve as sequence of straight lines.

length=0, δ=(end-start)/StepSize
for each x∈[start+δ,start+2δ,..,end] do
length = length + sqrt(δ2 + (f(x)-f(x-δ))2)
end for

81/90
Approximation for NP-hard Problems
Approximation is often used for NP-hard problems …

!"computing a near-optimal solution

!"in polynomial time

Examples:

!"vertex cover of a graph

!"subset-sum problem

82/90
Vertex Cover
Reminder: Graph G = (V,E)

!"set of vertices V
!"set of edges E

Vertex cover C of G …

!"C ⊆ V
!"for all edges (u,v) ∈ E either v ∈ C or u ∈ C (or both)

⇒ All edges of the graph are "covered" by vertices in C

... Vertex Cover 83/90

Example (6 nodes, 7 edges, 3-vertex cover):

Applications:

!"Computer Network Security

#"compute minimal set of routers to cover all connections
!"Biochemistry

... Vertex Cover 84/90

size of vertex cover C … |C| (number of elements in C)

optimal vertex cover … a vertex cover of minimum size

Theorem.
Determining whether a graph has a vertex cover of a given size k is an NP-complete problem.

... Vertex Cover 85/90

An approximation algorithm for vertex cover:

approxVertexCover(G):
| Input undirected graph G=(V,E)
| Output vertex cover of G
|
| C=∅
| unusedE=E
| while unusedE≠∅
| | choose any (v,w)∈unusedE
| | C = C∪{v,w}
| | unusedE = unusedE\{all edges incident on v or w}
| end while
| return C

Exercise #12: Vertex Cover 86/90

Show how the approximation algorithm produces a vertex cover on:

Possible result:

What would be an optimal vertex cover?

... Vertex Cover 89/90

Theorem.
The approximation algorithm returns a vertex cover at most twice the size of an optimal cover.

Proof. Any (optimal) cover must include at least one endpoint of each chosen edge.

Cost analysis …

!"repeatedly select an edge from E

#"add endpoints to C
#"delete all edges in E covered by endpoints

Time complexity: O(V+E) (adjacency list representation)

90/90
Summary
!"Alphabets and words
!"Pattern matching
#"Boyer-Moore, Knuth-Morris-Pratt
!"Tries
!"Text compression
#"Huffman code
!"Approximation
#"numerical problems
#"vertex cover

!"Suggested reading:
#"tries … Sedgewick, Ch. 15.2
#"approximation … Moffat, Ch. 9.4

Produced: 4 Apr 2022

Notes 5
No ratings yet
Notes 5
23 pages
Pattern Matching
No ratings yet
Pattern Matching
3 pages
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
No ratings yet
Outline and Reading: Strings ( 9.1.1) Pattern Matching Algorithms
3 pages
Abstract
No ratings yet
Abstract
12 pages
String Matching Algorithm
100% (1)
String Matching Algorithm
14 pages
Pattren Matching
No ratings yet
Pattren Matching
3 pages
28 - Text Processing
No ratings yet
28 - Text Processing
7 pages
UNIT-4 PPT New
No ratings yet
UNIT-4 PPT New
47 pages
CHPT 9 Pattern Matching
No ratings yet
CHPT 9 Pattern Matching
14 pages
Ads Unit5
No ratings yet
Ads Unit5
26 pages
String Search: 1 2 I I+1 I+m-1 N
No ratings yet
String Search: 1 2 I I+1 I+m-1 N
8 pages
Algo Lecture 7
No ratings yet
Algo Lecture 7
52 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
Unit 5
No ratings yet
Unit 5
14 pages
KMP Algorithm 1
No ratings yet
KMP Algorithm 1
22 pages
String Searching Algorithm
No ratings yet
String Searching Algorithm
22 pages
1 Strings and PatternMatching
No ratings yet
1 Strings and PatternMatching
44 pages
Knuth-Morris-Pratt Algorithm KENT
No ratings yet
Knuth-Morris-Pratt Algorithm KENT
4 pages
String Matching Kmprabin Karp and Naive
No ratings yet
String Matching Kmprabin Karp and Naive
41 pages
Ada Notes Unit 4
No ratings yet
Ada Notes Unit 4
28 pages
資料工程 Data Engineering: Pattern Matching 張賢宗
No ratings yet
資料工程 Data Engineering: Pattern Matching 張賢宗
38 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
4string Matching Kmprabin Karp and Naive
No ratings yet
4string Matching Kmprabin Karp and Naive
57 pages
MADFL 2025 Expt8
No ratings yet
MADFL 2025 Expt8
8 pages
Patternmatching
No ratings yet
Patternmatching
29 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
String Matching Chapter 12 Goodrich Nep
No ratings yet
String Matching Chapter 12 Goodrich Nep
43 pages
DAA Unit 5
No ratings yet
DAA Unit 5
22 pages
String Matching Algorithms: Antonio Carzaniga
No ratings yet
String Matching Algorithms: Antonio Carzaniga
11 pages
DS V Unit Notes
No ratings yet
DS V Unit Notes
33 pages
Pattern Matching 2
No ratings yet
Pattern Matching 2
46 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
String Matching
No ratings yet
String Matching
63 pages
String Matching
No ratings yet
String Matching
30 pages
String Matching Algorithms L1
No ratings yet
String Matching Algorithms L1
42 pages
Lecture 39 Knutt Morris Pratt
No ratings yet
Lecture 39 Knutt Morris Pratt
15 pages
Pattern Matching
No ratings yet
Pattern Matching
46 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
String Matching
No ratings yet
String Matching
35 pages
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
No ratings yet
Unit8 ADA SPPDF 2022 11 11 17 17 37pdf 2023 12 06 16 57 08
18 pages
KMP Algorithm
No ratings yet
KMP Algorithm
21 pages
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
No ratings yet
Mathematical Model For String Pattern Matching Algorithm (Boyer-Moore's Algorithm)
5 pages
Lecture 18 - String Matching-KMP
No ratings yet
Lecture 18 - String Matching-KMP
40 pages
M269 - Lec8 Fall 1819
No ratings yet
M269 - Lec8 Fall 1819
24 pages
Unit 3
No ratings yet
Unit 3
34 pages
W9 Presentation
No ratings yet
W9 Presentation
20 pages
W 9 Presentation
No ratings yet
W 9 Presentation
20 pages
String Matching: A Straightforward Solution The Knuth-Morris-Pratt Algorithm The Boyer-Moore Algorithm
No ratings yet
String Matching: A Straightforward Solution The Knuth-Morris-Pratt Algorithm The Boyer-Moore Algorithm
13 pages
MADF Unit 4
No ratings yet
MADF Unit 4
144 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
5 TH Long Ans
No ratings yet
5 TH Long Ans
31 pages
ADA Lect10
No ratings yet
ADA Lect10
12 pages
Co 4 (Lo 2)
No ratings yet
Co 4 (Lo 2)
12 pages
Algorithms in Bioinformatics
No ratings yet
Algorithms in Bioinformatics
7 pages
KMP 2
No ratings yet
KMP 2
7 pages
Week 10 Randomised Algorithms, Algorithm and Data Ethics, Course Review
No ratings yet
Week 10 Randomised Algorithms, Algorithm and Data Ethics, Course Review
21 pages
Week 5 Graph Algorithms
No ratings yet
Week 5 Graph Algorithms
42 pages
Week 2 Analysis of Algorithms
No ratings yet
Week 2 Analysis of Algorithms
36 pages
Week 4 Graph Data Structures
No ratings yet
Week 4 Graph Data Structures
46 pages
Week 09 Tutorial Sample Answers
No ratings yet
Week 09 Tutorial Sample Answers
8 pages
Week 08 Tutorial Sample Answers
No ratings yet
Week 08 Tutorial Sample Answers
4 pages