Short Notes On Knuth
Short Notes On Knuth
1. Purpose:
The KMP algorithm is a pattern-matching algorithm used to find the occurrence of a pattern PPP of length
mmm in a text TTT of length nnn. It avoids redundant comparisons, achieving a time complexity of
O(n+m)O(n + m)O(n+m).
2. Key Idea:
Instead of starting over after a mismatch, the algorithm uses information from the pattern itself to skip
unnecessary comparisons. This is done using a failure function.
3. Failure Function:
o The failure function f(j)f(j)f(j) for a pattern PPP represents the length of the longest prefix of PPP that
is also a suffix of P[1..j]P[1..j]P[1..j].
o Example for P="abacab"P = "abacab"P="abacab": j:012345P[j]:abacabf(j):001012j: 0 \quad 1 \quad
2 \quad 3 \quad 4 \quad 5 P[j]: a \quad b \quad a \quad c \quad a \quad b f(j): 0 \quad 0 \quad 1 \
quad 0 \quad 1 \quad 2 j:012345P[j]:abacabf(j):001012
4. Algorithm Steps:
o Preprocessing: Compute the failure function fff for PPP in O(m)O(m)O(m).
o Matching: Use fff to determine how far to shift the pattern after a mismatch, reducing unnecessary
comparisons.
5. Performance:
o Worst-case time complexity: O(n+m)O(n + m)O(n+m).
o This is optimal since every character in both TTT and PPP is processed at most once.
6. Advantages:
o Efficient for large TTT and PPP.
o Reduces the need to recheck previously matched characters.
7. C++ Implementation:
The algorithm can be implemented in C++ using two functions: one for matching (KMPMatch) and another
for computing the failure function (computeFailFunction).
Q2: Define the failure function f(j)f(j)f(j) and explain its significance in the KMP algorithm.
Answer:
The failure function f(j)f(j)f(j) is defined as the length of the longest prefix of PPP that is also a suffix of
P[1..j]P[1..j]P[1..j]. It helps the KMP algorithm efficiently shift the pattern PPP in the text TTT after a mismatch,
ensuring no redundant comparisons are made. It encodes information about repeated substrings within the pattern.
Let's compute the failure function f(j)f(j) for the pattern P="ababaca"P = "ababaca" step by step. The failure function
f(j)f(j) represents the length of the longest prefix of PP that is also a suffix of P[1..j]P[1..j].
Pattern PP:
P="a b a b a c a"P = "a \ b \ a \ b \ a \ c \ a"
Steps for f(j)f(j):
1. Initialization:
o f(0)=0f(0) = 0 (by definition).
o Start with i=1i = 1 (current position) and j=0j = 0 (length of longest prefix).
2. Step-by-Step Calculation:
o i=1i = 1:
P[1]=b≠P[0]=aP[1] = b \neq P[0] = a, so f(1)=0f(1) = 0.
No prefix matches the suffix for P[1..1]="ab"P[1..1] = "ab".
o i=2i = 2:
P[2]=a=P[0]P[2] = a = P[0], so f(2)=1f(2) = 1.
Prefix "a""a" matches suffix for P[1..2]="aba"P[1..2] = "aba".
o i=3i = 3:
P[3]=b=P[1]P[3] = b = P[1], so f(3)=2f(3) = 2.
Prefix "ab""ab" matches suffix for P[1..3]="abab"P[1..3] = "abab".
o i=4i = 4:
P[4]=a=P[2]P[4] = a = P[2], so f(4)=3f(4) = 3.
Prefix "aba""aba" matches suffix for P[1..4]="ababa"P[1..4] = "ababa".
o i=5i = 5:
P[5]=c≠P[3]=bP[5] = c \neq P[3] = b, so we use f(3)=2f(3) = 2.
P[5]=c≠P[2]=aP[5] = c \neq P[2] = a, so f(5)=0f(5) = 0.
No prefix matches the suffix for P[1..5]="ababac"P[1..5] = "ababac".
o i=6i = 6:
P[6]=a=P[0]P[6] = a = P[0], so f(6)=1f(6) = 1.
Prefix "a""a" matches suffix for P[1..6]="ababaca"P[1..6] = "ababaca".
Explanation:
f(j)f(j) gives us the information to skip unnecessary comparisons in the Knuth-Morris-Pratt algorithm when
there’s a mismatch during pattern matching.