The Knuth–Morris–Pratt (KMP) algorithm is an efficient string-matching algorithm used
to search for occurrences of a "pattern" string within a "text" string. It was developed by
Donald Knuth, Vaughan Pratt, and James H. Morris in 1977. The key idea behind the KMP
algorithm is to avoid unnecessary comparisons by preprocessing the pattern string.
Key Concepts
1. Partial Match Table (LPS Array):
o The Longest Prefix Suffix (LPS) array is the preprocessing step in KMP.
o It stores the length of the longest proper prefix of the pattern that is also a
suffix for each position in the pattern.
o The LPS array allows the algorithm to skip unnecessary comparisons.
2. String Matching:
o While comparing the pattern with the text, if a mismatch occurs, the algorithm
uses the LPS array to determine the next position in the pattern to resume
matching.
Steps of the Algorithm
1. Preprocessing:
o Construct the LPS array for the pattern string.
2. Search:
o Compare the pattern with the text.
o If there is a mismatch, use the LPS array to decide how far to shift the pattern.
o Repeat until the entire text is scanned or the pattern is found.
Example
Given:
Text: "ABABDABACDABABCABAB"
Pattern: "ABABCABAB"
Step 1: Build the LPS Array
The LPS array is constructed as follows:
Index (i) Pattern[i] LPS[i]
0 A 0
1 B 0
2 A 1
3 B 2
4 C 0
Index (i) Pattern[i] LPS[i]
5 A 1
6 B 2
7 A 3
8 B 4
Step 2: Pattern Matching
Now compare the pattern "ABABCABAB" with the text "ABABDABACDABABCABAB":
Match starts at index 0. Compare each character:
o Match ABAB → mismatch at index 4 (text: D, pattern: C).
o Use LPS to shift the pattern to align the prefix AB with the suffix.
Match resumes at index 5:
o Match ABABC → mismatch at index 9.
Match resumes at index 10:
o Match ABABCABAB at index 10.
Advantages
Efficient: Time complexity is O(m+n)O(m + n)O(m+n), where mmm is the length of
the pattern and nnn is the length of the text.
Preprocessing the pattern ensures minimal comparisons.
Applications
Pattern matching in text processing.
Finding substrings.
Plagiarism detection.
DNA sequence analysis.